Showing posts with label insert benchmark. Show all posts
Showing posts with label insert benchmark. Show all posts

Monday, October 21, 2024

The impact of PGO, LTO and more for MySQL on a small CPU: insert benchmark

This post has results for the impact of PGO, LTO and other compiler options for MySQL using the Insert Benchmark. I previously shared results like this for sysbench. The PGO builds here were created from profiles captured during sysbench and I am curious to understand whether the benefit from those builds extends to other workloads. 

tl;dr

  • PGO builds created using sysbench also help other workloads like the Insert Benchmark
  • results here are similar to results from sysbench
    • gcc does slightly better than clang without LTO or PGO
    • clang does slightly better than gcc with PGO
    • clang does more than slightly better than gcc with PGO + LTO
    • gcc with PGO has similar performance as with PGO + LTO

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

All work (builds & tests) are done on a server that uses Ubuntu 22.04.5. The gcc version is 11.4.0 and the clang version is 14.0.0-1ubuntu1.1.

The CMake command lines for all of the builds are here. All builds use CMAKE_BUILD_TYPE =Release. The builds are listed below and the name of the CMake command line file per build is cmk.80.$val where $val is the build name listed below:
  • rel_o2nofp
    • uses gcc, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel_o2nofp_clang
    • uses clang, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel
    • uses gcc, -O3, doesn't work great for flamegraphs
  • rel_clang
    • uses clang, -O3, doesn't work great for flamegraphs
  • rel_o2nofp_lto
    • like rel_o2nofp but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_lto_clang
    • like rel_o2nofp_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto
    • like rel but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto_clang
    • like rel_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_native
    • like rel_o2nofp but adds -march=native -mtune=native
  • rel_o2nofp_native_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native
  • rel_native
    • like rel but adds -march=native -mtune=native
  • rel_native_clang
    • like rel_clang but adds -march=native -mtune=native
  • rel_o2nofp_native_lto
    • like rel_o2nofp but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_o2nofp_native_lto_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto
    • like rel but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto_clang
    • like rel_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_pgo_use
    • like rel but uses PGO via -fprofile-generate. The profiles for PGO were captured while running sysbench.
  • rel_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate. The profiles for PGO were captured while running sysbench.
  • rel_lto_pgo_use
    • like rel but uses PGO via -fprofile-generate and -DWITH_LTO=ON. The profiles for PGO were captured while running sysbench.
  • rel_lto_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate and -DWITH_LTO=ON. The profiles for PGO were captured while running sysbench.
Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table. The database is cached by InnoDB.The benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section in the performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case MySQL 8.0.28 with the rel_o2nofp build.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Normally I summarize performance here but I think the summary in the tl;dr above is sufficient.

Saturday, September 14, 2024

MySQL and Postgres vs the Insert Benchmark on a large server

This has benchmark results for MySQL and Postgres vs the Insert Benchmark on a large server. My intent is to document how performance changes over time more so than start a Postgres vs MySQL argument.

MySQL has accumulated large regressions from 5.6 to 8.0 that are obvious on low-concurrency benchmarks. While they are less obvious on high-concurrency benchmarks, and there have been significant improvements to make MySQL 8 better at high-concurrency, the regressions that hurt low-concurrency results also reduce throughput at high-concurrency.

tl;dr

  • For Postgres 17rc1 vs 15.8
    • Performance is mostly the same
  • For MySQL 8.0.39 vs 5.6.51
    • Writes are much faster in 8.0.39
    • Reads throughput is mixed,  8.0.39 is slower than, similar to and faster than 5.6.51 depending on the context
  • For MySQL vs Postgres
    • MySQL is faster on point queries
    • Postgres is faster on writes and range queries
Builds, configuration and hardware

I compiled from source:
  • Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
  • MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer
The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The Postgres configuration file is here.
The MySQL configuration files are here for 5.6.515.7.44 and 8.0.39

The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with three workloads:
  • cached - database fits in the Postgres buffer pool
  • less IO-bound - database is larger than memory with 64M rows per table
  • more IO-bound - database is larger than memory with 200M rows per table
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 20 for cached, 64 for less IO-bound and 200 for more IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for less/more IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for less/more IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance reports are here:
The summary section in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 5.6.51 for MySQL vs Postgres, MySQL 5.6.51 for only MySQL and Postgres 15.8 for only Postgres.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

From the summary for only Postgres:
  • Write-heavy steps are up to 11% faster in 17rc1 vs 15.8
  • Read-heavy steps are unchanged in 17rc1 vs 15.8
From the summary for only MySQL:
  • Write-heavy steps are ~2.5X to ~4X faster in 8.0.39 vs 5.6.51
  • Read-heavy steps are between 12% and 27% slower in 8.0.39 vs 5.6.51 
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
  • MySQL 8.0.39 and Postgres 17rc1 get similar QPS on point queries
Results: less IO-bound

From the summary for only Postgres:
  • Write-heavy steps are up 13% faster in 17rc1 vs 15.8 except for l.i2 which suffers from an intermittently high CPU overhead from get_actual_variable_range
  • Read-heavy steps are mostly similar between 17rc1 and 15.8
From the summary for only MySQL:
  • Write-heavy steps are ~1.9X to ~4X faster in 8.0.39 vs 5.6.51
  • Read-heavy steps are up to 24% slower in 8.0.39 vs 5.6.51
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
    • The range queries are covering and the secondary indexes are mostly cached so the tests are CPU bound for both Postgres and MySQL (see rpq here)
  • MySQL 8.0.39 is faster on point queries
    • MySQL does ~0.1 reads/query while Postgres does ~3 (see rpq here). The table is index organized with MySQL and heap organized with Postgres. So with InnoDB the point queries just access the PK index while with Postgres they first access the PK index and then the heap table to get all of the columns.
Results: more IO-bound

From the summary for only Postgres:
  • Write-heavy steps have similar throughput between 17rc1 and 15.8 ignoring index create (l.x) and variance from l.i2 (problems with get_actual_variable_range)
  • Read-heavy steps are up to 6% faster in 17rc1 vs 15.8
From the summary for only MySQL:
  • Write-heavy steps are up to ~4X faster in 8.0.39 vs 5.6.51
  • Point queries are up to ~3X faster in 8.0.39 vs 5.6.51
  • Range queries are up to 27% slower in 8.0.39 vs 5.6.51
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
    • See the less IO-bound section above
    • MySQL does more read IO per range query (see rpq here)
  • MySQL 8.0.39 is faster on point queries
    • See the less IO-bound section above
    • Postgres does ~2X more read IO per point query (see rpq here)








Tuesday, August 27, 2024

Postgres 17beta3 vs the Insert Benchmark on a medium server: looking good

This has benchmark results for Postgres 12 through 17beta3 using the Insert Benchmark and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. A recent result up to Postgres 17 beta2 from the same server is here

This work was done by Small Datum LLC.

    tl;dr

    • 17beta3 looks (mostly) good
    • There might be regressions in 17 beta1, beta2 and beta3 on the l.i1 and l.i2 benchmark steps related to get_actual_variable_range
    Builds, configuration and hardware

    I compiled Postgres versions 12.19, 12.20, 13.15, 13.16, 14.12, 14.13, 15.7, 15.8, 16.3, 16.4, 17beta1, 17beta2 and 17beta3 from source using -O2 -fno-omit-frame-pointer.

    The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

    The configuration file is in the pg* subdirectories here with the name conf.diff.cx10a_c24r64.

    The Benchmark

    The benchmark is explained here and is run with 8 clients and a table per client with two workloads:
    • cached - database fits in the Postgres buffer pool
    • IO-bound - database is larger than memory and there are many reads from disk
    The benchmark steps are:

    • l.i0
      • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 10 for cached and 128 for IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for IO-bound.
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here for cached and for IO-bound.

    The summary (for cached and for IO-bound) in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is Postgres 16.3 and the versions for $me are 17beta1 and 17beta2. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: cached

    Summary:
    • Results for 17beta3 are similar to 16.4 with one exception. Results for the l.i2 benchmark step in 17 beta1, beta2 and beta3 are between 5% and 8% worse than in 16.4. I assume this is another problem related to get_actual_variable_range
      • This is confusing because 17 beta3 does better than the base case on l.i1 and the l.i1 workload is similar to l.i2 except there are more rows modified per transaction (so the optimizer overhead is amortized over more work).
      • For Postgres 17beta I see a slight increase in CPU per operation (cpupq) and a slight reduction in context switches per operation (cspq) in the metrics section relative to Postgres 16.4.
    From the summary the relative throughput per benchmark step is below. The base case is Postgres 12.19.
    • l.i0
      • relative QPS is 1.02 in PG 16.4
      • relative QPS is 0.98 in PG 17 beta3
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.06, 1.00 in PG 16.4
      • relative QPS is 1.090.92 in PG 17 beta3
    • qr100, qr500, qr1000
      • relative QPS is 1.031.041.04 in PG 16.4
      • relative QPS is 1.041.051.08 in PG 17 beta3
    • qp100, qp500, qp1000
      • relative QPS is 0.990.990.99 in PG 16.4
      • relative QPS is 0.980.980.98 in PG 17 beta3
    Results: IO-bound

    Summary:
    • Results for 17beta3 are similar to 16.4 with one exception. Results for the l.i1 and l.i2 benchmark steps in 17 beta1, beta2 and beta3 are mostly much worse. I assume this is another problem related to get_actual_variable_range
      • This is confusing because 17 beta3 does better than the base case on l.i1 and the l.i1 workload is similar to l.i2 except there are more rows modified per transaction (so the optimizer overhead is amortized over more work).
      • For Postgres 17beta I see a an increase in CPU per operation (cpupq) in the metrics section relative to Postgres 16.4.
    From the summary the relative throughput per benchmark step is below. The base case is Postgres 12.19.
    • l.i0
      • relative QPS is 1.00 in PG 16.4
      • relative QPS is 0.96 in PG 17 beta3
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.011.45 in PG 16.4
      • relative QPS is 0.881.01 in PG 17 beta3
    • qr100, qr500, qr1000
      • relative QPS is 0.980.980.98 in PG 16.4
      • relative QPS is 1.001.00, 0.99 in PG 17 beta3
    • qp100, qp500, qp1000
      • relative QPS is 1.01, 1.011.01 in PG 16.4
      • relative QPS is 1.011.011.01 in PG 17 beta3

    Monday, July 15, 2024

    MySQL 8.0.38 vs the Insert Benchmark on a medium server

    This has benchmark results for MySQL 8.0.38 and a few other 8.0 releases using the Insert Benchmark and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. A result for Postgres 17 on the same server is here.

    tl;dr

    • Performance for the write-heavy (l.i0, l.i1, l.i2) and point-query tests is ~5% worse after 8.0.28 and that has yet to be fixed. AFAIK, one problem is from changes related to function inlining for InnoDB, see PS-8822 and bug 111538.
    • Throughput for range queries (qr100, qr500, qr1000) improved by 5% after 8.0.28. I don't have a bug report for a fixed bug.
    Builds, configuration and hardware

    I compiled from source MySQL versions 8.0.28, 8.0.32, 8.0.36, 8.0.37 and 8.0.38.

    The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

    The my.cnf file is here.

    The Benchmark

    The benchmark is explained here and is run with 8 clients and a table per client with two workloads:
    • cached - database fits in the InnoDB buffer pool
    • IO-bound - database is larger than memory and there are many reads from disk
    The benchmark steps are:

    • l.i0
      • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 10 for cached and 128 for IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for IO-bound.
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here for cached and for IO-bound.
      The summary (for cached and for IO-bound) in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

      Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 8.0.28 and the versions for $me are 8.0.32, 8.0.36, 8.0.37 and 8.0.38. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
      • insert/s for l.i0, l.i1, l.i2
      • indexed rows/s for l.x
      • range queries/s for qr100, qr500, qr1000
      • point queries/s for qp100, qp500, qp1000
      Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

      Results: cached

      Summary:
      • Throughput for the write-heavy (l.i0, l.i1, l.i2) and point-query tests regressed (5% to 10% slower) after 8.0.28 and that has yet to be fixed. AFAIK, one problem is from changes related to function inlining for InnoDB, see PS-8822 and bug 111538.
      • Throughput for range queries (qr100, qr500, qr1000) improved by 5% after 8.0.28. I don't have a bug report for a fixed bug.
      • Tables with vmstat metrics show that CPU/operation (cpupq) changes to match the throughput changes
      From the summary, the relative throughput for 8.0.38 (vs 8.0.28) per benchmark step is:
      • l.i0
        • relative QPS is 0.96 in MySQL 8.0.38
      • l.x - I ignore this for now
      • l.i1, l.i2
        • relative QPS is 0.920.94 in MySQL 8.0.38
      • qr100, qr500, qr1000
        • relative QPS is 1.041.041.04 in MySQL 8.0.38
      • qp100, qp500, qp1000
        • relative QPS is 0.94, 0.95, 0.96 in MySQL 8.0.38
      Results: IO-bound

      Summary:
      • Throughput for the write-heavy (l.i0, l.i1, l.i2) and point-query tests regressed (~5%) after 8.0.28 and that has yet to be fixed. AFAIK, one problem is from changes related to function inlining for InnoDB, see PS-8822 and bug 111538.
      • Throughput for range queries (qr100, qr500, qr1000) improved by ~10% after 8.0.28. I don't have a bug report for a fixed bug.
      • Tables with vmstat metrics show that CPU/operation (cpupq) changes to match the throughput changes
      From the summary, the relative throughput for 8.0.38 (vs 8.0.28 per) benchmark step is:
      • l.i0
        • relative QPS is 0.94 in MySQL 8.0.38
      • l.x - I ignore this for now
      • l.i1, l.i2
        • relative QPS is 0.950.94 in MySQL 8.0.38
      • qr100, qr500, qr1000
        • relative QPS is 1.081.101.13 in MySQL 8.0.38
      • qp100, qp500, qp1000
        • relative QPS is 0.830.990.96 in MySQL 8.0.38

      Friday, July 12, 2024

      Postgres 17beta2 vs the Insert Benchmark on a medium server:: looking good

      This has benchmark results for Postgres 17beta2 using the Insert Benchmark and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. A result for sysbench from the same server is here. Results for the Insert Benchmark comparing 16.3 and 17beta1 on large server are here and here.

      This work was done by Small Datum LLC.

        tl;dr

        • 17beta2 looks good but results on public cloud HW might have more noise
        • I will repeat tests on my home servers soon
        Builds, configuration and hardware

        I compiled Postgres versions 16.3, 17beta1 and 17beta2 from source.

        The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

        The configuration file is here.

        The Benchmark

        The benchmark is explained here and is run with 8 clients and a table per client with two workloads:
        • cached - database fits in the Postgres buffer pool
        • IO-bound - database is larger than memory and there are many reads from disk
        The benchmark steps are:

        • l.i0
          • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 10 for cached and 128 for IO-bound.
        • l.x
          • create 3 secondary indexes per table. There is one connection per client.
        • l.i1
          • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for IO-bound.
        • l.i2
          • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for IO-bound.
          • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
        • qr100
          • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
        • qp100
          • like qr100 except uses point queries on the PK index
        • qr500
          • like qr100 but the insert and delete rates are increased from 100/s to 500/s
        • qp500
          • like qp100 but the insert and delete rates are increased from 100/s to 500/s
        • qr1000
          • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
        • qp1000
          • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
        Results: overview

        The performance reports are here for cached and for IO-bound.
          The summary (for cached and for IO-bound) in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

          Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is Postgres 16.3 and the versions for $me are 17beta1 and 17beta2. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
          • insert/s for l.i0, l.i1, l.i2
          • indexed rows/s for l.x
          • range queries/s for qr100, qr500, qr1000
          • point queries/s for qp100, qp500, qp1000
          Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

          Results: cached

          Summary:
          • Results here are less positive than the results on a large server I have at home. Perhaps there is more variance in HW perf in the public cloud. For now I will claim this detects no regressions and repeat tests on my home servers.
          • The worst results are from l.i2 which is related to problems from the CPU overhead in get_actual_variable_range, a problem I hope is fixed soon.
          • From vmstat metrics for l.i1 and l.i2 I don't see changes in CPU overhead (cpupq is CPU /operation) or context switches (cspq is context switches /operation). I do see a reduction in CPU utilization (cpups is the sum of vmstat us and sy columns).
          • From ps (output not shared here) I see that the connections doing DELETE statements used ~7% more CPU in 17beta2 vs 16.3. When DELETEs take longer to process, than the INSERT connection will be idle for more time which explains why the CPU utilization is lower with 17beta2 than with 16.3.
          From the summary the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 0.97 in PG 17beta1
            • relative QPS is 0.98 in PG 17beta2
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.030.95 in PG 17beta1
            • relative QPS is 1.040.93 in PG 17beta2
          • qr100, qr500, qr1000
            • relative QPS is 1.011.031.05 in PG 17beta1
            • relative QPS is 1.011.011.04 in PG 17beta2
          • qp100, qp500, qp1000
            • relative QPS is 1.011.001.00 in PG 17beta1
            • relative QPS is 1.011.011.00 in PG 17beta2
          Results: IO-bound

          Summary:
          • Results here are less positive than the results on a large server I have at home. Perhaps there is more variance in HW perf in the public cloud. For now I will claim this detects no regressions and repeat tests on my home servers.
          • The worst results are from l.i2 which is related to problems from the CPU overhead in get_actual_variable_range, a problem I hope is fixed soon.
          • From vmstat metrics for l.i1 and l.i2 I see changes in CPU overhead (cpupq is CPU /operation) but not in context switches (cspq is context switches /operation).
          • From ps (output not shared here) I see that the connections doing DELETE statements used ~20% more CPU in 17beta2 vs 16.3. When DELETEs take longer to process, than the INSERT connection will be idle for more time which explains why the CPU utilization is lower with 17beta2 than with 16.3.
          From the summary the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 0.96 in PG 17beta1
            • relative QPS is 0.96 in PG 17beta2
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.020.79 in PG 17beta1
            • relative QPS is 0.980.78 in PG 17beta2
          • qr100, qr500, qr1000
            • relative QPS is 1.001.00, 0.98 in PG 17beta1
            • relative QPS is 1.011.000.99 in PG 17beta2
          • qp100, qp500, qp1000
            • relative QPS is 1.00, 0.99, 0.99 in PG 17beta1
            • relative QPS is 1.00, 0.99, 0.89 in PG 17beta2

          Tuesday, May 14, 2024

          The Insert Benchmark: MariaDB, MySQL, new small server, cached workload, some concurrency

          This post has results for the Insert Benchmark on a small server with a cached workload. The goal is to compare MariaDB and MySQL and see whether they get better or worse over time. The results here are from the newest small servers in my test cluster -- an ASUS PN53.

          This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

          The workload here has low concurrency (1 or 4 clients) and the database is cached. The results might be different when the workload is IO-bound or has more concurrency. Results were recently shared from tests run on an older small server. 

          tl;dr

          • Modern MariaDB (11.4.1) is faster than modern MySQL (8.0.36) on all benchmark steps. MariaDB is almost 2X faster in the worst case for MySQL.
          • MariaDB 11.4.1 compared to 10.2.44 is up to 11% slower on the intiial load (l.i0), up to 26% faster on the write heavy steps (l.i1, l.i2) and up to 9% slower on the read-write steps (qr*, qp*). It has done a great job at avoiding regressions.
          • There are large regressions from MySQL 5.6.51 to 5.7.44 and then again from 5.7.44 to 8.0.36. In the worst case 5.6.51 is almost 2X faster than 8.0.36 and in most cases 8.0.36 is 15% to 30% slower than 5.6.51. 
          Build + Configuration

          This report has results for InnoDB from:
          • MySQL - versions 5.6.51, 5.7.44 and 8.0.36
          • MariaDB - versions 10.2.44, 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.4.1. Versions 10.2, 10.3, 10.4, 10.5, 10.6 and 10.11 are the most recent LTS releases and 11.4 will be the next LTS release.
          All of the my.cnf files are here. I tried to use similar configurations across releases, but isn't always possible. And even when it was possible I favor setting fewer entries especially for options where the default value changes between releases. The my.cnf files for MariaDB and MySQL can be split into two groups:
          • my.cnf.cz11a_c8r32 and my.cnf.cz11b_c8r32
            • these use innodb_purge_threads=4
            • the cz11a[pur1]_c8r32 versions use innodb_flush_method =O_DIRECT_NO_FSYNC while the cz11b[pur1]_c8r32 versions use =O_DIRECT (or its equivalent). All DBMS versions support cz11a_c8r32 except for MariaDB 11.4.1 which removed support for O_DIRECT_NO_FSYNC and uses the equivalent of O_DIRECT.
          • my.cnf.cz11apur1_c8r32 and my.cnf.cz11bpur1_c8r32
            • these use innodb_purge_threads=1 because on some of the older small servers, performance is worse with =4
            • see above for the discussion of cz11a vs cz11b
          My previous post has more details on the challenges of trying to use similar my.cnf files across MySQL and MariaDB.

          The Benchmark

          The benchmark is run with 1 client and then 4 clients, a cached workload and a table per client. It is explained here.

          The test server was named v7 here and is an ASUS PN53. It has 8 cores, 32G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

          The benchmark steps are:

          • l.i0
            • insert X million rows per table into Y tables in PK order. Each table has a PK index but no secondary indexes. There is one connection per client. With 1 client the values of X and Y are 50M and 1. While with 4 clients the values are X=12M and Y=4.
          • l.x
            • create 3 secondary indexes per table. There is one connection per client.
          • l.i1
            • use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
          • l.i2
            • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted.
            • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
          • qr100
            • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
          • qp100
            • like qr100 except uses point queries on the PK index
          • qr500
            • like qr100 but the insert and delete rates are increased from 100/s to 500/s
          • qp500
            • like qp100 but the insert and delete rates are increased from 100/s to 500/s
          • qr1000
            • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
          • qp1000
            • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
          Results

          The performance reports are here for:
          The summary in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

          Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
          • insert/s for l.i0, l.i1, l.i2
          • indexed rows/s for l.x
          • range queries/s for qr100, qr500, qr1000
          • point queries/s for qp100, qp500, qp1000
          Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

          Results: MariaDB vs MySQL

          MariaDB is much better at avoiding regressions than MySQL and the result over time means that MariaDB is now faster than MySQL on all of the benchmark steps. The results are similar between innodb_purge_threads =1 and =4 so I focus on =4.

          In both cases (1 client, 4 clients) the base case is MariaDB 10.11.7 with the cz11a_c8r32 config (ma101107_rel.cz11a_c8r32). It is compared with:
          • MariaDB 11.4.1 with the cz11b_c8r32 config (ma110401_rel.cz11b_c8r32)
          • MySQL 8.0.36 with the cz11a_c8r32 config (my8036_rel.cz11a_c8r32)
          From the summary for 1 client the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 1.02 in MariaDB 11.4.1
            • relative QPS is 0.62 in MySQL 8.0.36
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 0.920.97 in MariaDB 11.4.1
            • relative QPS is 0.860.78 in MySQL 8.0.36
          • qr100, qr500, qr1000
            • relative QPS is 1.00, 1.021.01 in MariaDB 11.4.1
            • relative QPS is 0.981.000.99 in MySQL 8.0.36
          • qp100, qp500, qp1000
            • relative QPS is 1.011.001.01 in MariaDB 11.4.1
            • relative QPS is 0.840.840.85 in MySQL 8.0.36
          From the summary for 4 clients the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 1.00 in MariaDB 11.4.1
            • relative QPS is 0.80 in MySQL 8.0.36
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 0.830.97 in MariaDB 11.4.1
            • relative QPS is 0.500.52 in MySQL 8.0.36
          • qr100, qr500, qr1000
            • relative QPS is 1.00, 0.991.00 in MariaDB 11.4.1
            • relative QPS is 0.98, 0.980.98 in MySQL 8.0.36
          • qp100, qp500, qp1000
            • relative QPS is 1.001.001.00 in MariaDB 11.4.1
            • relative QPS is 0.850.840.85 in MySQL 8.0.36
          Results: MariaDB

          MariaDB 11.4.1 compared to 10.2.44 is up to 11% slower on the intiial load (l.i0), up to 26% faster on the write heavy steps (l.i1, l.i2) and up to 9% slower on the read-write steps (qr*, qp*). It has done a great job at avoiding regressions.

          The results are similar between innodb_purge_threads =1 and =4 so I focus on =4.

          In both cases (1 client, 4 clients) the base case is MariaDB 10.2.44 with the cz11a_c8r32 config (ma100244_rel.cz11a_c8r32). It is compared with:
          • MariaDB 10.11.7 with the cz11a_c8r32 config (ma101107_rel.cz11a_c8r32)
          • MariaDB 11.4.1 with the cz11b_c8r32 config (ma110401_rel.cz11b_c8r32)
          From the summary for 1 client the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 0.93 in MariaDB 10.11.7
            • relative QPS is 0.95 in MariaDB 11.4.1
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.131.26 in MariaDB 10.11.7
            • relative QPS is 1.051.22 in MariaDB 11.4.1
          • qr100, qr500, qr1000
            • relative QPS is 0.92, 0.910.93 in MariaDB 10.11.7
            • relative QPS is 0.92, 0.930.93 in MariaDB 11.4.1
          • qp100, qp500, qp1000
            • relative QPS is 0.930.940.94 in MariaDB 10.11.7
            • relative QPS is 0.930.940.95 in MariaDB 11.4.1
          From the summary for 4 clients the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 0.89 in MariaDB 10.11.7
            • relative QPS is 0.89 in MariaDB 11.4.1
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.321.22 in MariaDB 10.11.7
            • relative QPS is 1.101.18 in MariaDB 11.4.1
          • qr100, qr500, qr1000
            • relative QPS is 0.94, 0.980.99 in MariaDB 10.11.7
            • relative QPS is 0.94, 0.970.99 in MariaDB 11.4.1
          • qp100, qp500, qp1000
            • relative QPS is 0.930.960.98 in MariaDB 10.11.7
            • relative QPS is 0.930.960.97 in MariaDB 11.4.1
          Results: MySQL

          There are large regressions from MySQL 5.6.51 to 5.7.44 and then again from 5.7.44 to 8.0.36. In the worst case 5.6.51 is almost 2X faster than 8.0.36 and in most cases 8.0.36 is 15% to 30% slower than 5.6.51. The results are similar between innodb_purge_threads =1 and =4 so I focus on =4.

          In both cases (1 client, 4 clients) the base case is MySQL 5.6.51 with the cz11a_c8r32 config (my5651_rel.cz11a_c8r32). It is compared with:
          • MySQL 5.7.44 with the cz11a_c8r32 config (my5744_rel.cz11a_c8r32)
          • MySQL 8.0.36 with the cz11a_c8r32 config (my8036_rel.cz11a_c8r32)
          From the summary for 1 client the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 0.86 in MySQL 5.7.44
            • relative QPS is 0.56 in MySQL 8.0.36
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.061.00 in MySQL 5.7.44
            • relative QPS is 0.980.84 in MySQL 8.0.36
          • qr100, qr500, qr1000
            • relative QPS is 0.84, 0.850.83 in MySQL 5.7.44
            • relative QPS is 0.75, 0.760.75 in MySQL 8.0.36
          • qp100, qp500, qp1000
            • relative QPS is 0.890.870.87 in MySQL 5.7.44
            • relative QPS is 0.710.700.70 in MySQL 8.0.36
          From the summary for 4 clients the relative throughput per benchmark step is:
          • l.i0
            • relative QPS is 1.05 in MySQL 5.7.44
            • relative QPS is 0.79 in MySQL 8.0.36
          • l.x - I ignore this for now
          • l.i1, l.i2
            • relative QPS is 1.301.31 in MySQL 5.7.44
            • relative QPS is 0.800.77 in MySQL 8.0.36
          • qr100, qr500, qr1000
            • relative QPS is 0.85, 0.860.86 in MySQL 5.7.44
            • relative QPS is 0.76, 0.760.77 in MySQL 8.0.36
          • qp100, qp500, qp1000
            • relative QPS is 0.870.870.87 in MySQL 5.7.44
            • relative QPS is 0.690.690.70 in MySQL 8.0.36


          Sunday, April 21, 2024

          The Insert Benchmark: MariaDB, MySQL, small server, cached workload, some concurrency

          This post has results for the Insert Benchmark on a small server with a cached workload. The goal is to compare MariaDB and MySQL.

          This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

          The workload here has some concurrency (4 clients) and the database is cached. The results might be different when the workload is IO-bound or has more concurrency. Results were recently shared for a workload with low concurrency (1 client),

          The results here are similar to the results on the low-concurrency benchmark.

          tl;dr

          • Modern MariaDB (11.4.1) is faster than modern MySQL (8.0.36) on all benchmark steps except for qr* (range query) and l.x (create index) where they have similar performance.
          • Modern MariaDB (11.4.1) was at most 12% slower than older MariaDB (10.2.44). MariaDB has done a great job of avoiding performance regressions over time.
          • There are significant performance regressions from older MySQL (5.6) to modern MySQL (8.0)
          • Configuring MariaDB 10.4 to make it closer to 10.5 (only one InnoDB buffer pool, only one redo log file) reduces throughput by ~5% on several benchmark steps

          Build + Configuration

          This report has results for InnoDB from:
          • MySQL - versions 5.6.51, 5.7.44 and 8.0.36
          • MariaDB - versions 10.2.44, 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.4.1. Versions 10.2, 10.3, 10.4, 10.5, 10.6 and 10.11 are the most recent LTS releases and 11.4 will be the next LTS release.
          All of the my.cnf files are here. I tried to use similar configurations across releases, but isn't always possible. And even when it was possible I favor setting fewer entries especially for options where the default value changes between releases.

          I started with the my.cnf.cz11a_bee config and then began to make small changes. For all configs I set these values to limit the size of the history list which also keeps the database from growing larger than expected. I rarely did this in the past.
          innodb_max_purge_lag=500000
          innodb_max_purge_lag_delay=1000000

          Some of the changes were challenging when trying to make things comparable. 
          • the InnoDB change buffer was removed in MariaDB 11.4. 
            • I disable it in all my.cnf files for all MariaDB versions except for the my.cnf.cz11d and my.cnf.cz11d1 configs.
            • I don't disable it for the MySQL configs named my.cnf.cz11[abc]_bee but I do disable it for the my.cnf.cz11d_bee config used by MySQL. The result is that for MariaDB the my.cnf.cz11d_bee config enables the change buffer while for MySQL it disables it. Sorry for the confusion.
          • innodb_buffer_pool_instances was removed in MariaDB 10.5 (assume it is =1). 
            • I don't set it to =1 in the my.cnf.cz[abc]_bee configs for MariaDB 10.2, 10.3, 10.4
          • innodb_flush_method was removed in MariaDB 11.4 and there is a new way to configure this.
            • In 11.4.1 there is an equivalent of =O_DIRECT but not of =O_DIRECT_NO_FSYNC
          For MariaDB the typical my.cnf files were:
          • my.cnf.cz11a_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC
          • Two configs that are only used with MariaDB 10.4 to make it closer to 10.5+
            • my.cnf.cz11abpi1_bee - like my.cnf.cz11a_bee except sets innodb_buffer_pool_instances and innodb_page_cleaners to =1. This config is used for  makes MairaDB 10.4
            • my.cnf.cz11aredo1_bee - like my.cnf.cz11a_bee except uses one big InnoDB redo log instead of 10 smaller ones
          • my.cnf.cz11b_bee - uses innodb_flush_method=O_DIRECT
          • my.cnf.cz11c_bee - uses innodb_flush_method=fsync
          • my.cnf.cz11d_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC and enables the InnoDB change buffer
          For MySQL the typical my.cnf files were:
          • my.cnf.cz11a_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC
          • my.cnf.cz11b_bee - uses innodb_flush_method=O_DIRECT
          • my.cnf.cz11c_bee - uses innodb_flush_method=fsync
          • my.cnf.cz11d_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC and disables the InnoDB change buffer. Note the my.cnf.cz11[abc]_bee configs for MySQL enabled it. This is the opposite of what is done for MariaDB.
          The Benchmark

          The benchmark is run with 4 clients (some concurrency), a cached workload and 4 tables (table per client). It is explained here. The initial load (l.i0) inserts 30M rows and the database fits in memory.

          The test server was named v4 here and is a Beelink SER4. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

          The benchmark steps are:

          • l.i0
            • insert 8 million rows per table in PK order (32M rows in total). Each table has a PK index but no secondary indexes. There is one connection per client.
          • l.x
            • create 3 secondary indexes per table. There is one connection per client.
          • l.i1
            • use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
          • l.i2
            • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted.
            • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
          • qr100
            • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
          • qp100
            • like qr100 except uses point queries on the PK index
          • qr500
            • like qr100 but the insert and delete rates are increased from 100/s to 500/s
          • qp500
            • like qp100 but the insert and delete rates are increased from 100/s to 500/s
          • qr1000
            • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
          • qp1000
            • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
          Results

          The performance reports are here for:
          The summary in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

          Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
          • insert/s for l.i0, l.i1, l.i2
          • indexed rows/s for l.x
          • range queries/s for qr100, qr500, qr1000
          • point queries/s for qp100, qp500, qp1000
          Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

          Results: MariaDB vs MySQL

          Modern MariaDB (11.4.1) is faster than modern MySQL (8.0.36) on all benchmark steps except for qr* (range query) and l.x (create index) where they have similar performance. This matches the low concurrency results.

          • The base case is MariaDB 10.11.7 with the cz11a_bee config (ma101107_rel.cz11a_bee). It is compared with
            • MariaDB 11.4.1 with the cz11b_bee config (ma110401_rel.cz11b_bee)
            • MySQL 8.0.36 with the cz11a_bee config (my8036_rel.cz11a_bee)
          • Relative throughput per benchmark step
            • l.i0
              • relative QPS is 0.99 in MariaDB 11.4.1
              • relative QPS is 0.75 in MySQL 8.0.36
            • l.x - I ignore this for now
            • l.i1, l.i2
              • relative QPS is 0.880.98 in MariaDB 11.4.1
              • relative QPS is 0.520.54 in MySQL 8.0.36
            • qr100, qr500, qr1000
              • relative QPS is 0.99, 1.011.00 in MariaDB 11.4.1
              • relative QPS is 1.081.051.06 in MySQL 8.0.36
            • qp100, qp500, qp1000
              • relative QPS is 0.981.001.00 in MariaDB 11.4.1
              • relative QPS is 0.770.790.77 in MySQL 8.0.36
          Results: MariaDB

          Modern MariaDB (11.4.1) was at most 12% slower than older MariaDB (10.2.44). This matches the low concurrency results. MariaDB has done a great job of avoiding performance regressions over time.

          • The base case is MariaDB 10.2.44 with the cz11a_bee config (ma100244_rel.cz11a_bee). It is compared with more recent LTS releases from 10.3, 10.4, 10.5, 10.6, 10.11 and 11.4.
          • Throughput per benchmark step for 11.4.1 relative to 10.2.44
            • l.i0
              • relative QPS is 0.88 in MariaDB 11.4.1
            • l.x - I ignore this for now
            • l.i1, l.i2
              • relative QPS is 0.991.09 in MariaDB 11.4.1
            • qr100, qr500, qr1000
              • relative QPS is 0.88, 0.930.92 in MariaDB 11.4.1
            • qp100, qp500, qp1000
              • relative QPS is 0.930.981.00 in MariaDB 11.4.1
          Results: MariaDB 10.4

          • The base case is MariaDB 10.4.33 with the cz11a_bee config (ma100433_rel.cz11a_bee).
          • It is compared with MariaDB 10.4.33 using the cz11abpi1_bee and cz11aredo1 configs
          • MariaDB with the cz11abpi1_bee config does worse on the l.i1 and l.i2 benchmark steps
            • There might be more contention on the one buffer pool instance
            • From the metrics with the l.i1 benchmark step I see larger values for context switches per query (cspq) and CPU per query (cpupq). I also see a lower CPU utilization (cpups) and write back rate (wmbps is storage MB/s written)
          • MariaDB with the cz11aredo1_bee config does worse on the qp* benchmark steps
            • From the metrics with the qp* benchmark steps I see a larger value for CPU per query (cpupq)
          • Throughput per benchmark step relative to the base case
            • l.i0
              • relative QPS is 0.99 with cz11abpi1_bee
              • relative QPS is 1.01 with cz11aredo1_bee
            • l.x - I ignore this for now
            • l.i1, l.i2
              • relative QPS is 0.900.94 with cz11abpi1_bee
              • relative QPS is 1.001.00 with cz11aredo1_bee
            • qr100, qr500, qr1000
              • relative QPS is 1.00, 1.001.01 with cz11abpi1_bee
              • relative QPS is 0.99, 0.98, 0.99 with cz11aredo1_bee
            • qp100, qp500, qp1000
              • relative QPS is 0.991.011.00 with cz11abpi1_bee
              • relative QPS is 0.95, 0.950.95 with cz11aredo1_bee
          Results: MariaDB 10.11

          • The base case is MariaDB 10.11.7 with the cz11a_bee config (ma101107_rel.cz11a_bee) and 10.11.7 releases with other configs are compared to it.
          • The results here are similar to the low-concurrency results except that the results here for the cz11b_bee and cz11c_bee configs are much worse for several of the write-heavy steps than they are in the low-concurrency test. 
          Results: MariaDB 11.4
          • The base case is MariaDB 11.4.1 with the cz11b_bee config (ma110401.cz11b_bee) and 11.4.1 releases with the cz11c_bee config is compared to it. Note that 11.4.1 does not support the equivalent of O_DIRECT_NO_FSYNC for innodb_flush_method.
          • For the cz11c_bee config
            • Performance for the write heavy steps (l.i0, l.i1, l.i2) is ~10% worse than the base case. This issue doesn't repeat on the low-concurrency results.
            • Performance for qp100 is ~7% worse than the base case. This is similar to the low-concurrency results.
          Results: MySQL

          There are significant performance regressions from MySQL 5.6 to 8.0 and the results here are similar to the low-concurrency results.

          • The base case is MySQL 5.6.51 with the cz11a_bee config (my5651_rel.cz11a_bee) and it is compared to MySQL 5.7.44 and 8.0.36.
          • Relative throughput per benchmark step
            • l.i0
              • relative QPS is 0.84 in MySQL 5.7.44
              • relative QPS is 0.60 in MySQL 8.0.36
            • l.x - I ignore this for now
            • l.i1, l.i2
              • relative QPS is 1.331.17 in MySQL 5.7.44
              • relative QPS is 0.780.72 in MySQL 8.0.36
            • qr100, qr500, qr1000
              • relative QPS is 0.79, 0.800.80 in MySQL 5.7.44
              • relative QPS is 0.690.680.69 in MySQL 8.0.36
            • qp100, qp500, qp1000
              • relative QPS is 0.820.860.81 in MySQL 5.7.44
              • relative QPS is 0.640.670.64 in MySQL 8.0.36

          Battle of the Mallocators: part 2

          This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRock...

          close